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Abstract — The contribution of this paper is to provide a 
semantic model (using soft constraints) of the words used by 
web-users to describe objects in a language game; a game in 
which one user describes a selected object of those composing 
the scene (see figure IT), and another user has to guess which 
object has been described. The given description needs to be 
non ambiguous and accurate enough to allow other users to 
guess the described shape correctly. 

To buUd these semantic models the descriptions need to be 
analyzed to extract the syntax and words' classes used (see (T) 
for details). We have modeled the meaning of these descriptions 
using soft constraints as a way for grounding the meaning. 

The descriptions generated by the system took into account 
the context of the object to avoid ambiguous descriptions, and 
allowed users to guess the described object correctly 72% of 
the times. 

I. Introduction 

Language can be seen as a system learnt and used by 
humans for communicating and learning, which covers a 
wide range of their daily activities. It is a social phenomenon 
resulting in an evolving system of great complexity. Lan- 
guage is inextricably linked to human capability to converse, 
learn, reason and make decisions in an environment of 
imprecision, uncertainty and lack of information. It is viewed 
here as a complex reality to be represented step by step, in 
an incremental fashion. 

In that respect, the most relevant feature of language is its 
"meaning" (its "use" according to Wittgenstein |J2l), and that 
does not only include the meaning/use of isolated words, but 
also the meaning/use of expressions as a whole. In general, 
the meaning/use of the words integrating an expression is 
only grasped in relation with the other words and within the 
meaning/use of the expression as a whole in a context 131. 

This work is part of an ongoing project called Smart-Bees. 
Smart-Bees is a project which aims to study how machines 
can learn and communicate in human-like ways, from a 
Computing with Words, Actions and Perceptions (CW-AP) 
perspective f4l|, fS). Users share a common environment and 
play different "language-games", in this case they share a 
blackboard with geometric shapes of different colors, sizes 
and positions, and play guessing and describing games. In 
the describing game, given an image with one selected object 
that users try to describe the selected object to other users in 
a non-ambiguous way (see figure [T]). In the guessing game, 
on the contrary, given a description and an image users try 
to guess which object was described. 

Sergio Guadarrama and David P. Pancho are with the European Centre for 
Soft Computing, Mieres (Astuiias), Spain (phone: +34 985456545; email: 
{sergio. guadarrama, david.perez} @ softcomputing.es). 



This project has its roots in Wittgentstein's ideas about 
Meaning and Language ||2l, ||6l, Zadeh's ideas on Linguistic 
Variables, Computing with Words and Generalized Con- 
straints |7|, 18 1, 19 1, |10l, im, Trillas' ideas on Words and 
Fuzzy Sets Iil2j . ||13J . Roy's ideas on meaning grounding 
El, IB], 116], In], Ql], and Guadarrama's works on 
Computing with Words, Actions and Perceptions ||4], |fT9] . 
tZQJ. 

To learn semantic models from descriptions of shapes 
given by users several steps are needed; to collect descrip- 
tions of shapes from web-users; to learn the lexicon and 
syntax used in that descriptions; to link that lexicon and 
syntax with the features of the shapes to learn the semantics; 
to generate new descriptions using the syntax and semantic 
learned; to test them with web-users. The final goal is to 
learn concepts, words and some sort of syntax and semantics, 
building a model grounded in the shared perceptions. 




Fig. 1 
The green circle in the front 



The contribution of this paper is to provide a semantic 
model (using soft constraints) of the words used by web- 
users to describe objects in a language game; a game in 
which one user describes the selected object among those 
composing the scene, and another user has to guess which 
object has been described (see figure [Til. The given descrip- 
tion needs to be non ambiguous and accurate enough to allow 
other users to guess the described object correctly. 

So far the system has 40 registered users, from 15 different 
countries, who had provided 360 descriptions using 150 



different words and had allowed the system to learn some 
lexicon (30 words), some syntax (20 patterns), and some 
semantics (7 word's classes grounded) for the shape descrip- 
tion task. The method proposed in this paper is performing 
quite well obtaining 100% correct spelled words, 88% syn- 
tactically correct sentences and 72% of semantically correct 
sentences; however users spelled on average correctly 97% 
of the words, wrote 93% of syntactically correct sentences 
and provided 75% of semantically correct sentences. 

The rest of the paper is structured as follows. In Section 
In] we describe related works and compare them with this 
one. In Section [III] we present our model of the meaning of 
words based on soft constraints, in Section IV we present 



the learning algorithm to learn the soft constraints from the 
data, and in Section [V] we present how the new descriptions 
are generated. Finally Section VI presents the main results 
and Section |VII| the main conclusions of the paper 

II. Related works 

The experiment presented here has been inspired in 
the DESCRIBER system done by Roy in US), where he 
presented a similar problem of learning the descriptions 
provided by one user about an scene composed by non- 
overlapping squares and rectangles. Nevertheless we have 
turned the experiment more realistic in several aspects: 
allowing different users to provide descriptions (web-users 
not familiar with the experiment), including more kind of 
shapes (triangles, circles, ovals) and allowing them to overlap 
(making harder the segmentation and the descriptions). Also 
it is important to remark that in the previous experiment 
the only user was a native English speaker, who provided 
very consistent descriptions, without spelling or syntactical 
errors and very few ambiguous descriptions, while in own 
experiment we have a variety of users from 15 different 
countries with only few being non-English native speakers. 

In that case the system they proposed was using bi-grams 
for the syntax learning and Gaussian mixtures for the se- 
mantical learning, obtaining a 81.3% of correct descriptions. 
But they have only one user who spelled correctly 100% of 
the words, wrote 100% of syntactically correct sentences and 
was able to provide 89% of semantically correct sentences, 
in comparison with our case that we start with 40 users from 
15 different countries who on average spelled correctly 97% 
of the words, wrote 93% of syntactically correct sentences, 
and were able to provide only 75% of semantically correct 
sentences. 

The problem of learning grounded words has been studied 
in Ha, OS, ini, ED, Ga, the proWem of social learning 
have been studied in |23|, |24|. The need to extend Fuzzy 
Logic to cope with the problems of CW-AP has been recently 
remarked in i), IB], lEg), 123, ll28l . 

These previous works may be contrasted with this one in 
two aspects. First, we see language learning as an integrated 
process where sensory-action learning, social learning and 
supervised learning are interleaved and combined. Second, 
we start from a multi-user perspective, where different users 
and agents interact and share their knowledge. 



The main difference of our approach with respect to other 
published works is the path taken, that is, the movement 
from manipulation of measurements to manipulation of per- 
ceptions, and from syntax-based systems to semantics-based 
systems. Other approaches underestimate the importance of 
imprecision inherent in language |29| and in perception ll30l 
and have tried to reduce it to simple forms of uncertainty, 
instead of dealing with it through a general theory of uncer- 
tainty LIU . 



III. Semantic models of words using soft 

CONSTRAINTS 

The meaning of a word is its use in language, and therefore 
it is context-dependent. Actually, words are grounded in 
actions and perceptions, and its use is learnt in a semi- 
supervised environment. Let us summarize the main prob- 
lems that have been faced in this work: 

• There are several misspelled words, some rare-words, 
that need to be corrected or discarded. 

• Different users use different words, different syntactic 
patterns and in some cases with different meanings. 

• The same object can be described in many different 
ways, depending on the context and on the intention of 
the user 

• Descriptions made should be understandable by other 
users, so it should be truthful, precise, context-relevant 
and non-ambiguous. 

• There are few examples for each word, and not all the 
possible combinations are seen. 

• Semi-supervised problem, only a small part of the data 
can be supervised. 

To collect human descriptions of shapes and to test the 
results, we have set up an interactive website. The system 
learn from the descriptions provided by humans and use 
the method described in this paper to produce its own 
descriptions. 

https://www3. softcomputing. es/smart-bees 

These are some examples of simple descriptions given by 
users: "the green rectangle", "big green triangle", "brown 
rectangle". And these are examples of compound descrip- 
tions: "light green rectangle at the bottom", "pink circle 
behind dark green square", "the shape under the green one", 
"light blue circle in the middle", "orange circle behind the 
yellow circle", "green small square in the background", "the 
dark orange rectangle behind the triangle". 

In this paper we will focus on the semantic learning, 
once the lexicon and syntax have been learnt (see paper 
LU also presented in this conference), and on the generation 
of new descriptions and their validation. Using the results 
from |T| we transformed the original problem of pairs of 
descriptions and shapes into a problem of sets of pairs of 
words and shapes. In which each set represents a class of 
words extracted by the syntax. 

Thus, each shape will be associated with all the words 
used in the description, and therefore it will have multiple 



labels attached. Given a set of words' classes associated with 
a set of objects, we need to learn when each word of the 
class is used based on the features of the shapes (see section 
IV I. This problem is similar to a multiple labeling problem, 
in which for each object and each words' class we need to 
decide which labels are applicable and to which degree. 

Given a set of pairs of words (taken as labels) and shapes 
(taken as objects) we need to learn why, when and how each 
label is used according to the features of the shapes, to the 
relations between shapes and to the grammatical rules. To 
calculate the degree of matching between a description and 
a selected object in a scene is very important, since it will be 
used later to calculate the degree of ambiguity, by comparing 
it with the matching degrees between the description and the 
other shapes forming the scene. 

A. Modeling the Meaning of Propositions 

As Zadeh suggested in 12611 and in ||3T| every proposition 
can be represented by a generalized constraint. 

"p" ^ X is R. 

Where X is a relevant variable constrained by R 
Example 

"John is Tall" => H eight{J ohn) is Tall 

Where Height{John) is a projection of some attributes of 
John. And Tall is a constraint on the values of the attributes 
of John. 

B. Modeling the Meaning of Descriptions 

In our case, given a description of an object ' x' it can be 
represented by a set of constraints. For example: 

• "The blue square" 

Color{x) is Blue and Shape{x) is Square 

• "The big dark green triangle in the background" 

Color{x) is Dark Green and Shape{x) is Triangle 
and Position{x) is Background 

Where Color, Shape and Position are projections of the 
features of x, and Blue, Square, Dark Green, Triangle, 
Background are constraints on the values of the projected 
features. 

Thus from the descriptions provided by the users and their 
corresponding images, the system learns which projections 
are associated with which words, and which constraints 
represent their meaning. 

C. Learning process 

The general phases of the learning process are listed 
bellow: 

• Learning the Lexicon: in this phase it is needed to 
select relevant words and filter misspelled words (it is 
presented in [l]) 

• Learning the Syntax: in this phase it is required to group 
words according to their role in the sentence, and learn 



a grammar (it is presented in |[T], and briefly shown in 
[IlLDl l. 

• Segmentation of images to extract objects and features, 
and pair the segmented objects with descriptions (this 
is presented in section [rV-A| i. 

• Learning the Semantics: Generate a model for each 
word belonging to the cluster in the projected space ac- 
cording to the features selected (this phase is presented 
in section HV] 

• Generation sentences: in this phase syntactically and 
semantically correct sentences are generated for new 
images (this phase is presented in section IVli. 

• Evaluation of results: Once all the sentences are gener- 
ated an evaluation process is performed, in which the 
users try to understand the sentences and select the 
corresponding object (this phase is presented in section 

Let us recall the results of the lexical and syntax learning 
phases from the paper yj. Words with frequency smaller 
that 10 have been filtered, and remained 30 words which 



after clustering formed 7 words' classes (shown in III-D i 



and generated a syntax composed by 20 patterns shown in 
table U 

D. Classes of words 

Class 1 = { THE, A } 
Class 2 = { BACKGROUND, FRONT } 

Class 3 = { CIRCLE, OVAL, TRIANGLE, RECTANGLE, EL- 
LIPSE, SQUARE } 
Class 4 = { ON, IN, AT, BEHIND } 
Class 5 = { LIGHT, BIG, DARK } 
Class 6 = { TOP, BOTTOM, RIGHT, LEFT } 
Class 7 = { PINK, BLUE, GREEN, ORANGE, RED, YELLOW, 
PURPLE, VIOLET, BROWN } 

TABLE I 
Most frequent patterns 



Frequency 



18.89% 
6.94% 
6.39% 
5.83% 
3.89% 
3.33% 
3.06% 
2.50% 
2.22% 
1.66% 
1.11% 
1.11% 
0.83% 
0.83% 
0.83% 
0.83% 
0.83% 
0.83% 
0.83% 
0.83% 



Pattern 



7 3 
1 7 3 

1 3 
3 

7 3 4 12 

5 7 3 

2 7 3 
17 3 4 1 
7 3 4 16 
7 

6 3 

1 7 3 
1 7 

1 5 7 

2 5 7 
34 1 
34 1 



1 7 3 



7 3 4 13 
134 166 
5734 12 



IV. Learning the Semantics 

The system needs to learn why those specific words were 
used to describe that object in that context (image). For 
that we analyzed images to segment and extract objects 
and measure their features. We used a scaffolding learning: 
starting from simple descriptions before learning compound 
descriptions; of the 360 descriptions with all their words in 
the lexicon, 75% are simple and 25% are compound. 

Words belonging to the same class have different mean- 
ings; for example given a class of words = {'BLUE', 'RED', 
'GREEN', 'YELLOW',..} we assume that each word have 
a different meaning, and therefore should be represented by 
different model, even though, in some cases different words 
can be applied to the same object to some extent. 

A. Shapes' segmentation 

A fuzzy edge detector was used to find the edges of 
the shapes, then using a filling transformation to found the 
regions inside the edges, and finally using a color-based 
clustering and an overlapping detection we grouped the 
regions into shapes. After obtaining a set of candidate shapes 
- comprised by a set of pixels - they were matched with the 
selected object and its corresponding description. 

For each shape a set of 20 features were measured, 
including: 

• Average RGB: Red, Green, Blue. 

• Average YCbCr: Y is the luma component, and Cb and 
Cr are the blue-difference and red-difference chroma 
components. 

• Bounding Box: Coordinates of the bounding box. 

• Height and width. 

• Center of gravity: position of the center of gravity. 

• Bounding Ellipse: Orientation and size of the bounding 
ellipse. 

• Major Minor: length of the major and minor axis of the 
bounding ellipse. 

• Extension: proportion of the bounding box filled. 

• Height to width ratio. 

• Area: number of pixels. 

• Holes: proportion of holes in the object. 



B. Multi-classification problem 

It is important to notice that different users describe 
differently the same objects, even they used different words 
and different syntax. So the training data could contain 
different labels for the same object or not label at all. Some 
objects have only labels for some of the word's classes but 
nor for all; for example "The blue square" only specify that 
the color is blue and the shape is square but say nothing 
about the size or position of the object described. 

There are also many objects that have not being described 
by any user, so we also have many un-labeled objects. 

The system learn which projection (relevant features) is 
appropriate for each class of words and which constraints 
(relevant values) are associated with each word. For every 




Fig. 2 
Fuzzy Decision Tree for Class 2 (Depth) 



class of words we assume that one projection is shared by all 
the words in the class. For every word in a class we assume 
that it is represented by one constraint over the projection of 
the class. 

To obtain a robust classifier in despite of the aforemen- 
tioned problems we have decided to use fuzzy decision trees 
for their robustness and flexibility. And also because they 
also do feature selection during the learning process 

C. Fuzzy decision Trees 

A different set of features could be relevant for each class 
of words. So we used fuzzy decision trees ll32l to classify 
the objects according to their labels and cross validation to 
prune the tree and select the most relevant features. In figure 
|2]can be seen the fuzzy decision tree of Class 2. 

The features selected for each class are the following: 



Class 


Features 


Class 1 


- 


Class 2 


Holes Minor 


Class 3 


Ext HW-ratio 


Class 4 


- 


Class 5 


G 


Class 6 


X Area 


Class 7 


CrCb 



From the features selected we can see that none is related 
to Class 1 nor to Class 3, that means that from the current 
features their meaning remains unground (or unlearned). This 
is due to the fact that those classes are more related to the 
syntax that to the semantics, nevertheless the fuzzy decision 
tree learns that the most frequent word should be used by 
default. 

The decision trees for each class are the following: 

. Class 1: If true then 'THE' 

• Class 2: See figure! 

• Class 3: See figure 
. Class 4: If true then 'IN' 
. Class 5: If g < 0.64 then 'LIGHT' else 'DARK' 

• Class 6: See figure! 

• Class 7: See figure 




Fig. 3 
Fuzzy Decision Tree for Class 3 (Shapes) 




Fig. 4 
Fuzzy Decision Tree for Class 6 (Positions) 





Fig. 6 
Fuzzy Labels for Class 7 (Colors) 




Fig. 5 
Fuzzy Decision Tree for Class 7 (Colors) 
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HW-ratio 



Fig. 7 
Fuzzy Labels for Class 3 (Shapes) 



composing the description and on an aggregation function 
(in our case the minimum). 

Once the projections (relevant features) and the soft con- 
straints (fuzzy labels) have been learnt for each word's 
class we can transform every description into generalized 
constraints using the syntax, as follows: 



D. Fuzzy Labels 

Once the fuzzy decision trees are built for each word's 
class we can calculate the degree of matching between every 
object and every word obtaining a soft constraint for each 
label. For example in the case of class 7 (colors) and class 3 
(shapes) we obtain the fuzzy labels plotted in figures l6] and 
It] respectively. 

E. Degree of matching of descriptions 

The degree of matching between one description and one 
object depends on the degree of matching of each word 



The 


blue 


square 


class 1 


class? 


class3 


- 


CrCb{x) is Blue 


ExtHWratioix) is Square 


fJ-The 


IJ'Blue 


^Square 



From that we can calculate the degree of matching jjLm 
between an object x and a description D by: 



, £>) = Pi ^iiabeh {x) ; V labeli € pattern{D) 



Hm[x 



where pattern{D) is the sequence of labels of a given 
description, and fiiabeh is the fuzzy label representing each 
word of the description. 



F. Degree of ambiguity 

The degree of ambiguity of one description in one scene 
depends on the degrees of matching between the description 
and the objects of the scene. Because if there are more than 
one object with high degree of matching then the description 
could refer to various objects and be ambiguous. 

In every scene there are several objects, and any given 
description can be ambiguous if it is applicable to several of 
these objects. We can calculate the degree of ambiguity cta 
of a description D in an scene S by: 



aAiD,x,S) 



Sup fiM{y,D)) 



where x is the object with highest degree of matching 
and UMiVjD) represents the degree of matching between 
the description D and the other objects y ^ x present in 
the scene. Thus the higher the degree of matching with the 
other objects the higher the degree of ambiguity, because the 
description would not be discriminative enough. 

V. Generating descriptions 

For generating descriptions the system will look for short, 
truthful and non-ambigous descriptions, and will follow the 
next algorithm: 

1) Given an scene with one selected object. 

Segment it, extract the objects and their features. 
Get the most frequent short syntax pattern. 
For each word's class find the label with the highest 
degree of matching. 
5) Build the description and calculate the degree of am- 
biguity. 

If the description is non-ambiguous return the descrip- 
tion with the highest degree of matching; else go to 
step |3]l and look for the next pattern and repeat the 
process. 



2) 
3) 
4) 



6) 




Fig. 8 
The red rectangle 



For example, in the scene seen in figure [8] the system 
segment it and found 7 objects with their 20 features, starting 



by most frequent short pattern (1 3 7) it calculates the degree 
of matching for each label in Class 1, Class 3 and in Class 
7; it finds that The G Classl, Rectangle G Classi and 
Red G Class? have the highest degree of matching 

D = 'The red rectangle' 

= min(l, 0.68, 0.74) = 0.68 
aA{D,x,S)= Sup fiM{y,D)) =0.11 

y^x,yeS 

Nevertheless, in the scene seen in figure [T] when the 
system calculates the degree of matching starting by most 
frequent short pattern (1 3 7) it finds out that the degree of 
ambiguity is high. 

D = 'The green circle' 

^Af (X, D) = m.m{j2The {x) , fJ-Green (x) , ^iC^rcle (x) ) 

= min(l, 0.78, 0.57) =0.57 
aAiD,x,S) = Sup fiMiy,D)) =0.53 

y^x,yeS 

But it turns out that the ambiguity degree is also high, 
thus the system keep trying with other patterns until it finds 
one with lower degree of ambiguity (13 7 4 12) while 
maintaining a high degree of matching, in this case: 

D = 'The green circle in the front' 

CFm{x,D) = min(^The (a;) , HCreen {x) , tlCircle {x) , 
fJ-Inix), HThe, fJ'Frontix)) 

= min(l, 0.78, 0.57, 1, 1, 0.61) = 0.57 
aAiD,x,S) = Sup HM{y,D)) =0.07 

y^x,yeS 

VI. Results 

To compare this work with the previous one [TSl and to 
check the influence of the different options considered in the 
paper we have defined three methods: 

• Method 1: In this case we used the algorithm and 
features proposed in this paper but without using the 
degree of ambiguity to avoid ambiguous descriptions. 

• Method 2: In this case we used the algorithm and 
features proposed by Roy in his paper IJSj . 

• Method 3: In this case we used the algorithm and 
features proposed in this paper and used the degree of 
ambiguity to avoid ambiguous descriptions. 

For the scene shown in figure [Tithe descriptions generated 
by the three methods are: 

. Method 1: GREEN CIRCLE 

. Method 2: THE LIGHT GREEN CIRCLE 

. Method 3: THE GREEN CIRCLE IN THE FRONT 

and for the scene shown in figure |8] are: 
. Method 1: THE RED RECTANGLE 
. Method 2: THE PINK RECTANGLE 
. Method 3: THE RED RECTANGLE 



After generating the descriptions for the 350 scenes using 
the three methods we included them in the web-page, so the 
users can try to guess which objects are being described. 
To warranty the fairness of the experiment the users don't 
know which descriptions are generated automatically by the 
system and which ones come from other users. Actually 
which description is shown to each user is selected randomly 
among all. Counting as correct that descriptions that other 
users guessed right we obtained the results showed in figure 

13 



Ranking of Users 




Fig. 9 
Ranking users according to their performance 



In figure |9] can be seen that the Method 1 is performing 
bellow average, it obtains 49% of the descriptions correct, 
and it is ranked #39 which means that other 4 users are 
performing even worse. The Method 2 is performing a little 
bit better obtaining 57% of the descriptions correct (while 
below the results presented in the previous work 81.3%) and 
it is ranked #35. The Method 3 is performing quite well 
obtaining 72% of the descriptions correct (just a little bit 
over the average of users) and it is ranked #27. 

VII. Conclusions 

So far the system has 40 registered users, from 15 dif- 
ferent countries, who had provided 360 descriptions using 
150 different words and had allowed the system to learn 
some lexicon (30 words), some syntax (20 patterns), and 
some semantics (7 word's classes grounded) for the shape 
description task. The best method it is performing quite well 
obtaining 100% correct spelled words, 88% syntactically 
correct sentences and 72% of semantically correct sentences; 
despite the variety of users, who spelled correctly 97% of 
the words, wrote 93% of syntactically correct sentences and 
provided 75% of semantically correct sentences. 

We have provided a semantic model (using soft con- 
straints) of the words used by web-users to describe objects 
for other users in a describing game. The descriptions 
generated took into account the context of the object to 



avoid ambiguous descriptions, allowing users to guess the 
described object correctly. A future work is to study the 
construction of complex phrases, those referring to more than 
one object. 

With the approach taken in this work the possibility to 
study semantic models for specific words used by specific 
users in specific contexts is opened. This can be seen as a 
step in the development of Computing with Words whose 
relevance have been highlighted by Zadeh in flOl . Il26ll 
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